Loan Repayment Challenge¶

by Lau Wen Jun

1. Introduction¶

In thefinancial landscape, the ability to accurately predict loan risk and quality stands as a pivotal challenge for lending institutions. The intricate dynamics of lending, coupled with the ever-evolving economic environment, underscore the necessity for robust predictive analytics in loan management. This document delineates an analytical approach to forecasting loan repayment outcomes, leveraging machine learning to enhance decision-making processes in the lending domain.

1.1. Background¶

The financial sector has witnessed a significant transformation, driven by advancements in data analytics and machine learning technologies. These developments have enabled lenders to refine their risk assessment models, thereby improving their loan portfolio's performance. Traditionally, the evaluation of loan applications relied heavily on manual underwriting processes, which, while effective, are time-consuming and often fail to capture the complexity of borrower profiles. The integration of predictive analytics into loan assessment processes offers a more nuanced understanding of potential risks and rewards, facilitating more informed lending decisions.

1.2. Problem Statement¶

Despite the wealth of data available to lenders, the challenge of accurately predicting loan repayment remains daunting. Inaccurate risk assessments can lead to higher default rates, adversely affecting the financial health of lending institutions and restricting access to credit for worthy borrowers. Moreover, the dynamic nature of the economy, characterized by fluctuating interest rates, changing regulatory landscapes, and evolving borrower behaviors, adds layers of complexity to loan risk assessment. There is a critical need for an analytical framework that can effectively predict loan risk and quality, leveraging the full spectrum of available data to make informed lending decisions.

1.3. Objective¶

The primary objective of this analysis is to develop and evaluate predictive models capable of accurately assessing loan risk and quality based on a comprehensive set of borrower and loan characteristics. This entails:

  • Utilizing exploratory data analysis to identify key factors influencing loan repayment outcomes.

  • Employing data processing and feature engineering techniques to prepare the data for predictive modeling.

  • Implementing and comparing multiple machine learning algorithms to determine the most effective model in predicting 'Good' loans.

  • Evaluating the models based on their predictive accuracy and their potential impact on loan portfolio profitability.

2. Methodology¶

The methodology adopted for this analysis encompasses a structured approach to understanding, preparing, and modeling loan data to predict loan repayment outcomes and assess their financial implications. The process is detailed across several subsections, including Exploratory Data Analysis, Data Processing, Feature Selection, Predictive Analysis, and Model Evaluation.

2.1. Data Preparation¶

Exploratory Data Analysis (EDA) initiates the analytical process, delineating the dataset's characteristics to establish a foundational understanding of its dimensions, structure, and variable diversity. EDA involves calculating descriptive statistics for both numerical and categorical variables, uncovering distributions, central tendencies, and variabilities. This step provides an empirical framework for pattern identification and data quality assessment.

Data visualization plays a pivotal role in EDA, leveraging histograms, box plots, scatter plots, and correlation heatmaps. These tools facilitate the examination of variable distributions, outlier detection, and the exploration of feature interactions, enhancing the interpretability of statistical findings. Graphical representations offer an intuitive grasp of the data's underlying structure and variable relationships, crucial for informed modeling.

Following EDA, the data preprocessing phase addresses data quality through cleaning operations aimed at rectifying issues like missing values, anomalies, and outliers. Categorical variables undergo encoding to numerical formats to ensure compatibility with machine learning algorithms. Additionally, feature engineering is employed to create new variables, enhancing model predictive capabilities based on EDA-derived insights and domain expertise.

2.2. Feature Engineering¶

The methodology for developing predictive models to forecast loan repayment begins with crucial data preparation steps, including a train-test split and feature scaling. Initially, the dataset is divided into training and testing sets, a critical process that ensures models are evaluated on unseen data, thereby testing their generalizability and robustness. Following the split, feature scaling is performed to standardize the range of data features using methods such as StandardScaler. This step is essential for models sensitive to the scale of input features, as it ensures no single variable disproportionately influences the model's predictions due to its scale, enabling a fair and balanced assessment of each feature's predictive power.

Subsequently, the focus shifts to feature selection, a process that identifies the most significant predictors of loan repayment. This involves two main approaches: univariate analysis and model-based selection. Univariate analysis, conducted via SelectKBest with the f_classif criterion, evaluates the statistical significance of individual features in relation to the target variable, GoodLoan, ranking them based on their ability to predict loan repayment outcomes independently. In parallel, model-based selection using a RandomForest Classifier offers a deeper insight by assessing feature importance within the context of a model, highlighting how each feature contributes to the overall predictive accuracy. The integration of these approaches results in a consolidated set of key features, meticulously chosen for their collective impact on accurately predicting loan repayment, thereby providing a solid foundation for the development of robust predictive models.

2.3. Machine Learning Model¶

In the predictive analysis phase, a comprehensive suite of machine learning algorithms is employed to assess the probability of loan repayments. This selection includes Logistic Regression, Decision Tree, Random Forest, and XGBoost models, each chosen for its unique methodological strengths and suitability for classification challenges.

  • Logistic Regression is utilized for its capability to offer probabilistic insights into the impact of predictors on the classification of loans as 'Good', facilitated by its linear approach.

  • The Decision Tree algorithm provides an explicit representation of decision paths, segmenting the dataset into increasingly homogeneous subsets, thus offering an interpretable model structure.

  • Enhancement of Decision Trees through the Random Forest algorithm involves aggregating multiple trees to improve prediction accuracy and mitigate the risk of overfitting by averaging their outputs.

  • XGBoost stands out for its efficiency and effectiveness, leveraging a gradient boosting framework to sequentially build weak learners into a robust predictive model, optimizing both speed and accuracy.

Algorithms are trained on datasets emphasizing key predictive features, with their performance assessed through accuracy, precision, recall, and F1 score metrics. This evaluation, including analysis of confusion matrices, enables the selection of the most effective model for predicting 'Good Loans'. Implementing this model in a hypothetical business scenario highlights the impact of machine learning on improving decisions in loan management and risk assessment, showcasing the value of predictive analytics in strategic business operations.

3. Exploratory Data Analysis¶

During the exploratory data analysis phase, three critical datasets were utilized to gain insights into the factors influencing loan repayment and to build the foundation for the predictive models. These datasets include:

  • loan.csv

  • payment.csv

  • clarity_underwriting_variables.csv

These datasets, when combined, offer a multidimensional view of the loan lifecycle, from initial application through to funding and repayment. The EDA phase focused on cleansing these datasets, identifying key features, and understanding the relationships between various attributes and the target variable, 'GoodLoan'. This comprehensive analysis is instrumental in developing robust predictive models capable of accurately forecasting loan repayment outcomes.

3.1. loan.csv¶

This dataset serves as the primary source of loan-related information, encompassing various attributes of the loan application process, such as loan approval status, funding status, and other loan characteristics. It provides a comprehensive overview of each loan's lifecycle from application to funding, crucial for understanding the factors that contribute to a loan being classified as 'Good' or 'Bad'.

In [178]:
import pandas as pd

# Load the datasets
loan_df = pd.read_csv('/content/loan.csv', low_memory=False)

# Display the first few rows of each dataset to understand their structure
loan_df_head = loan_df.head()

# Display the shape of each dataset
loan_df_shape = loan_df.shape

loan_df_head, loan_df_shape
Out[178]:
(          loanId                          anon_ssn payFrequency    apr  \
 0  LL-I-07399092  beff4989be82aab4a5b47679216942fd            B  360.0   
 1  LL-I-06644937  464f5d9ae4fa09ece4048d949191865c            B  199.0   
 2  LL-I-10707532  3c174ae9e2505a5f9ddbff9843281845            B  590.0   
 3  LL-I-02272596  9be6f443bb97db7e95fa0c281d34da91            B  360.0   
 4  LL-I-09542882  63b5494f60b5c19c827c7b068443752c            B  590.0   
 
               applicationDate  originated              originatedDate  \
 0  2016-02-23T17:29:01.940000       False                         NaN   
 1  2016-01-19T22:07:36.778000        True  2016-01-20T15:49:18.846000   
 2  2016-08-01T13:51:14.709000       False                         NaN   
 3  2015-08-06T23:58:08.880000       False                         NaN   
 4  2016-06-05T22:31:34.304000       False                         NaN   
 
    nPaidOff  approved  isFunded             loanStatus  loanAmount  \
 0       0.0     False         0  Withdrawn Application       500.0   
 1       0.0      True         1          Paid Off Loan      3000.0   
 2       0.0     False         0  Withdrawn Application       400.0   
 3       0.0     False         0  Withdrawn Application       500.0   
 4       0.0     False         0               Rejected       350.0   
 
    originallyScheduledPaymentAmount state     leadType  leadCost fpStatus  \
 0                            978.27    IL  bvMandatory         6      NaN   
 1                           6395.19    CA    prescreen         0  Checked   
 2                           1199.45    MO  bvMandatory         3      NaN   
 3                           1074.05    IL  bvMandatory         3      NaN   
 4                            814.37    NV  bvMandatory         3      NaN   
 
              clarityFraudId  hasCF  
 0  5669ef78e4b0c9d3936440e6      1  
 1  569eb3a3e4b096699f685d64      1  
 2  579eab11e4b0d0502870ef2f      1  
 3  555b1e95e4b0f6f11b267c18      1  
 4  5754a91be4b0c6a2bf424772      1  ,
 (577682, 19))

3.1.1. Customer Retention Insights from Loan Application Data¶

An analysis of loan application data, specifically examining anonymized Social Security Numbers (anon_ssn), reveals significant insights into customer behavior and retention. The dataset comprises 577,682 loan applications, with 459,393 being unique entries. This distinction highlights a repeat application rate of 20.48%, indicating that over one-fifth of the applications are from returning customers. Additionally, the average number of applications per unique customer stands at 1.26.

In [179]:
# Calculate total count of anon_ssn entries
total_anon_ssn_count = loan_df['anon_ssn'].count()

# Calculate unique count of anon_ssn
unique_anon_ssn_count = loan_df['anon_ssn'].nunique()

print(f"Total count of anon_ssn entries: {total_anon_ssn_count}")
print(f"Unique count of anon_ssn: {unique_anon_ssn_count}")

# Additional business metrics to consider:
# 1. Repeat Application Rate: The percentage of applications that are from repeat customers
repeat_application_rate = ((total_anon_ssn_count - unique_anon_ssn_count) / total_anon_ssn_count) * 100

# 2. Average Applications per Customer: The average number of applications submitted per unique customer
average_applications_per_customer = total_anon_ssn_count / unique_anon_ssn_count

print(f"Repeat Application Rate: {repeat_application_rate:.2f}%")
print(f"Average Applications per Customer: {average_applications_per_customer:.2f}")
Total count of anon_ssn entries: 577682
Unique count of anon_ssn: 459393
Repeat Application Rate: 20.48%
Average Applications per Customer: 1.26

3.1.2. Repayment Frequency Trends in Loan Performance¶

An examination of the 'payFrequency' variable within the loan dataset indicates varying preferences for repayment schedules among borrowers. The most common frequency is biweekly (B), accounting for over 50% of the loans, followed by weekly (W) payments. Monthly (M), semi-monthly (S), and irregular (I) schedules are less prevalent. These trends may influence loan performance, as different repayment frequencies could correlate with borrowers' cash flow stability and ability to consistently meet payment obligations.

In [180]:
import matplotlib.pyplot as plt

# Analyzing the impact of payFrequency on loan performance
pay_frequency_counts = loan_df['payFrequency'].value_counts(normalize=True) * 100

# Plotting the distribution of payFrequency
plt.figure(figsize=(10, 6))
pay_frequency_counts.plot(kind='bar')
plt.title('Distribution of Repayment Frequency')
plt.xlabel('Repayment Frequency')
plt.ylabel('Percentage')
plt.xticks(rotation=45)
plt.show()

3.1.3. APR Distribution Analysis in Loan Data¶

The 'Annual Percentage Rate' (APR) values within the loan dataset were transformed into a numeric format and analyzed for distribution characteristics. To address any missing or non-numeric values in the 'apr' field, these entries were coerced into numeric format, with any resulting missing values subsequently filled using the median APR. This approach ensures a consistent dataset for analysis. The density plot of APR values reveals several peaks, suggesting a diverse range of interest rates across the loans. These APR distributions could be indicative of varied loan products, differing risk assessments, and borrowers' creditworthiness.

In [181]:
import seaborn as sns

# convert APR to numeric
loan_df['apr'] = pd.to_numeric(loan_df['apr'], errors='coerce')

# Fill missing values in 'apr' column with the median of the column
median_apr = loan_df['apr'].median()
loan_df['apr'].fillna(median_apr, inplace=True)

# Plotting the APR distribution
plt.figure(figsize=(10, 6))
sns.kdeplot(loan_df['apr'], bw_adjust=0.5)
plt.title('Distribution of APR Values')
plt.xlabel('Annual Percentage Rate (APR)')
plt.ylabel('Density')
plt.show()

3.1.4. Trend Analysis of Loan Applications Over Time¶

The temporal distribution of loan applications, as presented in the graph, shows the count of applications over time based on their application date. Analysis of the data reveals fluctuating volumes of loan applications throughout the observed period, with notable peaks indicating surges in application by the end of 2015 to 2017 . These variations could correspond to seasonal trends, economic factors, or changes in marketing strategies.

In [182]:
# 'applicationDate' in datetime format
loan_df['applicationDate'] = pd.to_datetime(loan_df['applicationDate'])

# Count the number of applications by date
applications_by_date = loan_df['applicationDate'].dt.date.value_counts().sort_index()

# Create a line plot for applications by date
plt.figure(figsize=(14, 7))
sns.lineplot(data=applications_by_date)
plt.title('Count of Applications Over Time')
plt.xlabel('Application Date')
plt.ylabel('Number of Applications')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

3.1.5. Loan Origination Timeline Analysis¶

The graph offer a distribution of the time taken from loan application to origination, highlighting the efficiency of the loan processing workflow. Most loans are processed within a very short period, with a sharp decrease in frequency as the number of days increases. This swift turnaround in the majority of cases may indicate a streamlined and efficient origination process. The log scale on the second chart allows for a clearer view of the distribution for longer processing times, emphasizing the tail of the distribution which represents a smaller, yet significant, number of loans with extended origination timelines.

In [183]:
# Convert originatedDate to datetime format
loan_df['originatedDate'] = pd.to_datetime(loan_df['originatedDate'])

# Calculate the difference in days between originatedDate and applicationDate
loan_df['days_to_originate'] = (loan_df['originatedDate'] - loan_df['applicationDate']).dt.total_seconds() / (24 * 60 * 60)

# Filter the DataFrame to include only rows where 'days_to_originate' is less than 40
filtered_loan_df = loan_df[loan_df['days_to_originate'] < 40]

# Set up the matplotlib figure
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Regular histogram
axes[0].hist(filtered_loan_df['days_to_originate'], bins=range(0, 41), edgecolor='black')
axes[0].set_title('Time Taken from Application to Origination (Less than 40 Days)')
axes[0].set_xlabel('Days to Originate')
axes[0].set_ylabel('Number of Loans')
axes[0].set_xlim(0, 40)


# Histogram with log scale
axes[1].hist(filtered_loan_df['days_to_originate'], bins=range(0, 41), edgecolor='black', log=True)
axes[1].set_title('Log Scale: Time Taken from Application to Origination (Less than 40 Days)')
axes[1].set_xlabel('Days to Originate')
axes[1].set_ylabel('Number of Loans (Log Scale)')
axes[1].set_xlim(0, 40)
plt.show()

3.1.6. Loan Application Outcomes Analysis¶

The pie charts depict the proportions of loans at different stages of the lending process. Specifically, they show the percentages of loans that have been originated, approved, and ultimately funded. A small fraction, around 7-8%, progresses through each stage, underscoring a relatively high filtration rate during the lending process. This consistency suggests that while the majority of applications do not result in originated loans, those that do have a strong likelihood of progressing to approval and funding.

In [184]:
# Convert boolean values to strings for pie chart labeling
loan_df['originated'] = loan_df['originated'].replace({True: 'Yes', False: 'No'})
loan_df['approved'] = loan_df['approved'].replace({True: 'Yes', False: 'No'})
loan_df['isFunded'] = loan_df['isFunded'].replace({1: 'Yes', 0: 'No'})

# calculate the counts for the pie chart
originated_counts = loan_df['originated'].value_counts()
approved_counts = loan_df['approved'].value_counts()
funded_counts = loan_df['isFunded'].value_counts()


# Create a function to print the simplified decomposition tree
def print_simplified_decomposition_tree(df):
    for originated in df['originated'].unique():
        for approved in df['approved'].unique():
            # Filter the dataframe for the current branch
            branch_df = df[(df['originated'] == originated) & (df['approved'] == approved)]

            # For each 'isFunded' status, print the count
            for isFunded in df['isFunded'].unique():
                funded_count = branch_df[branch_df['isFunded'] == isFunded].shape[0]
                print(f"Originated: {originated} -> Approved: {approved} -> Funded: {isFunded} (Count: {funded_count})")

print_simplified_decomposition_tree(loan_df)

# create the pie chart
fig, axs = plt.subplots(1, 3, figsize=(14, 7))

# Pie chart for originated loans
axs[0].pie(originated_counts, labels=originated_counts.index, autopct='%1.2f%%', startangle=90)
axs[0].set_title('Proportion of Originated Loans')

# Pie chart for approved loans
axs[1].pie(approved_counts, labels=approved_counts.index, autopct='%1.2f%%', startangle=90)
axs[1].set_title('Proportion of Approved Loans')

# Pie chart for approved loans
axs[2].pie(funded_counts, labels=funded_counts.index, autopct='%1.2f%%', startangle=90)
axs[2].set_title('Proportion of Funded Loans')

# Show the plot
plt.show()
Originated: No -> Approved: No -> Funded: No (Count: 531674)
Originated: No -> Approved: No -> Funded: Yes (Count: 0)
Originated: No -> Approved: Yes -> Funded: No (Count: 2)
Originated: No -> Approved: Yes -> Funded: Yes (Count: 0)
Originated: Yes -> Approved: No -> Funded: No (Count: 5972)
Originated: Yes -> Approved: No -> Funded: Yes (Count: 0)
Originated: Yes -> Approved: Yes -> Funded: No (Count: 1052)
Originated: Yes -> Approved: Yes -> Funded: Yes (Count: 38982)

3.1.7. Repayment History Analysis¶

Upon analyzing the loan repayment history in the loan dataset, null values in the 'PaidOff' column were replaced with 0, indicating no previous loans paid off. Grouping the data by the 'nPaidOff' (number of loans paid off by each customer) reveals that a vast majority of borrowers, totaling 564,623, have not paid off any previous loans. A much smaller number of borrowers have a history of loan repayment, with the count progressively decreasing as the number of loans paid off increases.

The histogram visualizations, with and without a logarithmic scale, illustrate the distribution of the number of loans paid off by customers. The steep decline in frequency from 0 loans paid off to any higher number highlights that a large portion of the customer base are first-time borrowers or have not fully repaid a previous loan.

In [185]:
# Change null values in 'PaidOff' to 0
loan_df['nPaidOff'] = loan_df['nPaidOff'].fillna(0)

# Group by 'nPaidOff' and count the number of loans in each group
n_paid_off_grouped = loan_df.groupby('nPaidOff').size().reset_index(name='Count')

# Set up the matplotlib figure with two subplots side by side
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot histogram without a log scale on the first subplot
sns.histplot(loan_df['nPaidOff'], bins=22, kde=False, ax=axes[0])
axes[0].set_title('Histogram of Paid Off Loans Count')
axes[0].set_xlabel('Number of Loans Paid Off')
axes[0].set_ylabel('Frequency')

# Plot histogram with a log scale on the y-axis on the second subplot
sns.histplot(loan_df['nPaidOff'], bins=22, kde=False, ax=axes[1])
axes[1].set_yscale('log')  # Apply logarithmic scale to the y-axis
axes[1].set_title('Histogram of Paid Off Loans Count (Log Scale)')
axes[1].set_xlabel('Number of Loans Paid Off')
axes[1].set_ylabel('Frequency (Log Scale)')

# Adjust the layout so that titles and labels do not overlap
plt.tight_layout()

# Show the plot
plt.show()

3.1.8. Financial Distribution Analysis by Loan Quality¶

In the loan dataset, the 'loanStatus' category was refined to correct inconsistencies, and loans were subsequently classified as 'Good' or 'Bad'. The 'loanQuality' distribution shows a majority of loans labeled under specific quality criteria, which is pivotal for risk stratification. Missing values in 'loanAmount' were addressed by imputing the median value, which preserves data integrity for further analysis.

Two visual analyses were conducted: a violin plot examining the distribution of loan amounts by loan status, and a separate analysis for scheduled payment amounts. These visuals indicate variability in financial figures across different loan statuses, reflecting a diversity in loan types and borrower agreements.

Stacked bar plots were then generated to display the average loan amount and scheduled payment amount, segmented by loan status and quality. These plots offer a nuanced understanding of the average financial obligations associated with each loan status, categorized by the loan's risk quality.

In [186]:
import numpy as np

# A dictionary for cleaning typos and consolidating statuses
# Key: original status (possibly with typo), Value: cleaned and consolidated status

loan_df['loanStatus'] = loan_df['loanStatus'].replace({
    'Customver Voided New Loan': 'Customer Voided New Loan',
    'Settlement Pending Paid Off': 'Settlement Paid Off',
    'Charged Off': 'Charged Off Paid Off'
})

# Define the criteria for labeling loans as 'Low Risk'
good_loan = [
    'Paid off Loan',
    'New Loan',
    'Returned Item',
    'Pending Paid Off',
    'Internal Collection',
    'External Collection',
    'Charged Off Paid Off',
    'Charged Off','Settled Bankruptcy',
    'Settlement Paid Off',
    'Settlement Pending Paid Off'
]

# Apply the criteria to label loans as 'Low Risk' or 'High Risk'
loan_df['loanQuality'] = loan_df['loanStatus'].apply(lambda x: np.nan if pd.isna(x) else ('Good' if x in good_loan else 'Bad'))

# Check the distribution of the newly created 'loanRisk' column
print(loan_df['loanQuality'].value_counts())

# Fill missing values in 'loanAmount' column with the median of the column
median_loan_amount = loan_df['loanAmount'].median()
loan_df['loanAmount'].fillna(median_loan_amount, inplace=True)

# Violin plot for loan amount by loan status
plt.figure(figsize=(10, 6))
sns.violinplot(x='loanStatus', y='loanAmount', data=loan_df)
plt.title('Distribution of Loan Amount by Loan Status')
plt.xticks(rotation=90)  # Rotate x labels if they overlap
plt.show()

# Violin plot for originally scheduled payment amount by loan status
plt.figure(figsize=(10, 6))
sns.violinplot(x='loanStatus', y='originallyScheduledPaymentAmount', data=loan_df)
plt.title('Distribution of Scheduled Payment Amount by Loan Status')
plt.xticks(rotation=90)
plt.show()
Bad     549731
Good     27560
Name: loanQuality, dtype: int64
In [187]:
# Calculate average 'loanAmount' and 'originallyScheduledPaymentAmount' by 'loanStatus' and 'loanQuality'
average_loan_amount = loan_df.groupby(['loanStatus', 'loanQuality'])['loanAmount'].mean().unstack()
average_scheduled_payment = loan_df.groupby(['loanStatus', 'loanQuality'])['originallyScheduledPaymentAmount'].mean().unstack()

# Create stacked bar plots
fig, ax = plt.subplots(2, 1, figsize=(14, 12))

average_loan_amount.plot(kind='bar', stacked=True, ax=ax[0])
ax[0].set_title('Average Loan Amount by Loan Status and Quality')
ax[0].set_xlabel('Loan Status')
ax[0].set_ylabel('Average Loan Amount')
ax[0].legend(title='Loan Quality')
ax[0].tick_params(axis='x', rotation=90)

average_scheduled_payment.plot(kind='bar', stacked=True, ax=ax[1])
ax[1].set_title('Average Scheduled Payment Amount by Loan Status and Quality')
ax[1].set_xlabel('Loan Status')
ax[1].set_ylabel('Average Scheduled Payment Amount')
ax[1].legend(title='Loan Quality')
ax[1].tick_params(axis='x', rotation=90)

plt.tight_layout()
plt.show()

3.1.9. Loan Amount and Payment Correlation by Loan Status¶

The bubble plot provides a high-level overview, mapping average loan amounts against scheduled payments, with the size and color of each bubble representing the frequency and type of loan status, respectively.

In [188]:
# Calculate the average 'loanAmount' and 'originallyScheduledPaymentAmount' by 'loanStatus'
average_amounts = loan_df.groupby('loanStatus').agg({
    'loanAmount': 'mean',
    'originallyScheduledPaymentAmount': 'mean'
}).reset_index()

# Calculate the count of loans for each 'loanStatus'
loan_counts = loan_df['loanStatus'].value_counts().reset_index()
loan_counts.columns = ['loanStatus', 'Count']

# Merge the average amounts DataFrame with the counts DataFrame
bubble_data = pd.merge(average_amounts, loan_counts, on='loanStatus')

# Create the bubble plot using seaborn
plt.figure(figsize=(10, 8))
bubble_plot = sns.scatterplot(data=bubble_data, x='loanAmount', y='originallyScheduledPaymentAmount',
                              size='Count', hue='loanStatus', sizes=(20, 2000), alpha=0.6)

# Adjust details for clarity
plt.title('Bubble Plot of Loan Amount vs. Scheduled Payment Amount by Loan Status')
plt.xlabel('Average Loan Amount')
plt.ylabel('Average Scheduled Payment Amount')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend(title='Loan Status', bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0)
plt.show()

The regression plots for each loan status offer a granular view of the relationship between loan amounts and scheduled payment amounts. Annotating each plot with an R-squared value allows for the quantification of the strength of this relationship. Notably, a majority of the loan statuses exhibit R-squared values exceeding 0.9, signifying an exceptionally strong correlation between the loan amounts and their corresponding payment schedules. This high level of predictability in the data reflects well-established patterns in lending practices, where loan amounts have a direct and strong influence on the payment structure. This pattern is consistent across various types of loans, indicating that, in most cases, as the loan amount increases, the scheduled payment amount follows suit in a predictable manner.

In [189]:
import statsmodels.api as sm

# Create a Facet Grid
g = sns.FacetGrid(loan_df, col='loanStatus', col_wrap=4, height=3)

# Define a custom function to plot with regplot and calculate R-squared
def custom_regplot(*args, **kwargs):
    # Extract data
    x = kwargs.pop('x')
    y = kwargs.pop('y')
    data = kwargs.pop('data')

    # Filter out NaNs for regression calculation
    mask = ~data[x].isna() & ~data[y].isna()
    data = data[mask]

    # Calculate R-squared
    X = sm.add_constant(data[x])
    model = sm.OLS(data[y], X).fit()
    rsquared = model.rsquared

    # Make regplot
    ax = sns.regplot(x=x, y=y, data=data, *args, **kwargs)

    # Annotate the plot with R-squared value
    ax.text(0.05, 0.95, f'R-squared: {rsquared:.2f}', transform=ax.transAxes,
            verticalalignment='top', fontsize=9, color='red')

# Map the custom function to the grid
g.map_dataframe(custom_regplot, x='loanAmount', y='originallyScheduledPaymentAmount', scatter_kws={'alpha':0.3})

# Add titles and adjust the axis
g.set_titles("{col_name}")
g.set_axis_labels("Loan Amount", "Scheduled Payment Amount")

# Show the plot
plt.show()

3.1.10. Financial Distribution & Analysis of Loans by States¶

An in-depth review of the geographical distribution of loans reveals a pattern of lending activity that is concentrated in specific states. Ohio tops the chart with a substantial lead of 90,497 loans, with Illinois and Texas not far behind. This distribution pattern may reflect various influencing factors such as the economic vibrancy of these states, the effectiveness of lending institutions' marketing strategies, and the impact of state-specific financial regulations.

In [190]:
import folium
import json

# Dictionary to map state abbreviations to full names
state_abbreviations = {
    'AL': 'Alabama', 'AK': 'Alaska', 'AZ': 'Arizona', 'AR': 'Arkansas', 'CA': 'California',
    'CO': 'Colorado', 'CT': 'Connecticut', 'DE': 'Delaware', 'FL': 'Florida', 'GA': 'Georgia',
    'HI': 'Hawaii', 'ID': 'Idaho', 'IL': 'Illinois', 'IN': 'Indiana', 'IA': 'Iowa',
    'KS': 'Kansas', 'KY': 'Kentucky', 'LA': 'Louisiana', 'ME': 'Maine', 'MD': 'Maryland',
    'MA': 'Massachusetts', 'MI': 'Michigan', 'MN': 'Minnesota', 'MS': 'Mississippi',
    'MO': 'Missouri', 'MT': 'Montana', 'NE': 'Nebraska', 'NV': 'Nevada', 'NH': 'New Hampshire',
    'NJ': 'New Jersey', 'NM': 'New Mexico', 'NY': 'New York', 'NC': 'North Carolina',
    'ND': 'North Dakota', 'OH': 'Ohio', 'OK': 'Oklahoma', 'OR': 'Oregon', 'PA': 'Pennsylvania',
    'RI': 'Rhode Island', 'SC': 'South Carolina', 'SD': 'South Dakota', 'TN': 'Tennessee',
    'TX': 'Texas', 'UT': 'Utah', 'VT': 'Vermont', 'VA': 'Virginia', 'WA': 'Washington',
    'WV': 'West Virginia', 'WI': 'Wisconsin', 'WY': 'Wyoming', 'OH-TEST': 'Ohio',
    'TX-TEST': 'Texas'
}

# Convert state abbreviations to full names
loan_df['state'] = loan_df['state'].str.upper().map(state_abbreviations).fillna(loan_df['state'])

# Count the number of loans by state
state_loan_counts = loan_df['state'].value_counts().reset_index()
state_loan_counts.columns = ['state', 'loanCount']

# Sort the DataFrame by 'loanCount' in descending order to get the states with the highest loan counts
state_loan_counts_sorted = state_loan_counts.sort_values(by='loanCount', ascending=False)

# Get the top 5 states with the highest loan counts
top_5_states = state_loan_counts_sorted.head(5)
bottom_5_states = state_loan_counts_sorted.tail(5)

print(top_5_states)
print(bottom_5_states)

# Convert the state_loan_counts DataFrame to a dictionary
state_counts_dict = state_loan_counts.set_index('state')['loanCount'].to_dict()

# Load the GeoJSON file
with open('/content/us-states.json') as f:
    geojson_data = json.load(f)

# Update the GeoJSON properties with the loan counts
for feature in geojson_data['features']:
    state = feature['properties']['name']
    feature['properties']['loanCount'] = state_counts_dict.get(state, 0)

# Initialize a map centered around the geographic center of the contiguous US
m = folium.Map(location=[37.0902, -95.7129], zoom_start=4)

# Add the choropleth layer
folium.Choropleth(
    geo_data=geojson_data,
    name='choropleth',
    data=state_loan_counts,
    columns=['state', 'loanCount'],
    key_on='feature.properties.name',  # This should match the property in the GeoJSON file
    fill_color='YlGn',
    fill_opacity=0.2,
    line_opacity=0.5,
    legend_name='Loan Count'
).add_to(m)

# Add tooltips to the map
folium.GeoJson(
    geojson_data,
    name='GeoJSON - Loan Data',
    tooltip=folium.GeoJsonTooltip(
        fields=['name', 'loanCount'],
        aliases=['State:', 'Loan Count:'],
        localize=True
    )
).add_to(m)

# Add layer control to toggle the choropleth
folium.LayerControl().add_to(m)

# Display the map
m.save('loan_counts_by_state.html')
m
       state  loanCount
0       Ohio      90497
1   Illinois      66710
2      Texas      49730
3   Missouri      49249
4  Wisconsin      40349
           state  loanCount
37        Hawaii        621
38        Alaska        290
39  North Dakota        207
40      New York          3
41      Maryland          3
Out[190]:
Make this Notebook Trusted to load map: File -> Trust Notebook

When examining the financial aspects of the loans distributed across the United States, Ohio stands out as not only the state with the highest number of loans but also the highest total loan amount reaching nearly $42 million. In addition, Ohio also leads in originally scheduled payment amounts, indicating robust lending and repayment activity. Such financial metrics offer a clear indication of the states' lending market size and potential revenue streams for financial institutions.

In [191]:
# Aggregate the loan amounts and scheduled payments by state
state_loan_amounts = loan_df.groupby('state')['loanAmount'].sum().reset_index()
state_scheduled_payments = loan_df.groupby('state')['originallyScheduledPaymentAmount'].sum().reset_index()

# Sort in descending order to get the states with the highest and lowest amounts
state_loan_amounts_sorted = state_loan_amounts.sort_values(by='loanAmount', ascending=False)
state_scheduled_payments_sorted = state_scheduled_payments.sort_values(by='originallyScheduledPaymentAmount', ascending=False)

# Get the top 5 states with the highest amounts
top_5_states_loan_amounts = state_loan_amounts_sorted.head(5).reset_index(drop=True)
top_5_states_scheduled_payments = state_scheduled_payments_sorted.head(5).reset_index(drop=True)

print(top_5_states_loan_amounts)
print(top_5_states_scheduled_payments)

# Convert the aggregated data to a dictionary
loan_amount_dict = state_loan_amounts.set_index('state').to_dict()['loanAmount']
scheduled_payments_dict = state_scheduled_payments.set_index('state').to_dict()['originallyScheduledPaymentAmount']

# Load the GeoJSON file
with open('/content/us-states.json') as f:
    geojson_data = json.load(f)

# Update the GeoJSON properties with the loan data
for feature in geojson_data['features']:
    state = feature['properties']['name']
    feature['properties']['loanAmount'] = loan_amount_dict.get(state, 0)
    feature['properties']['scheduledPaymentAmount'] = scheduled_payments_dict.get(state, 0)

# Initialize the map centered around the geographic center of the contiguous US
m = folium.Map(location=[37.0902, -95.7129], zoom_start=4)


# Add the choropleth layer for loan amount
folium.Choropleth(
    geo_data=geojson_data,
    name='Choropleth - Loan Amount',
    data=state_loan_amounts,
    columns=['state', 'loanAmount'],
    key_on='feature.properties.name',
    fill_color='BuPu',
    fill_opacity=0.2,
    line_opacity=0.5,
    legend_name='Total Loan Amount'
).add_to(m)

# Add the choropleth layer for originally scheduled payment amount
folium.Choropleth(
    geo_data=geojson_data,
    name='Choropleth - Scheduled Payments',
    data=state_scheduled_payments,
    columns=['state', 'originallyScheduledPaymentAmount'],
    key_on='feature.properties.name',
    fill_color='YlOrRd',
    fill_opacity=0.2,
    line_opacity=0.5,
    legend_name='Total Scheduled Payment Amount'
).add_to(m)

folium.GeoJson(
    geojson_data,
    name='US States',
    style_function=lambda feature: {
        'fillColor': 'blue',
        'color': 'black',
        'weight': 2,
        'dashArray': '5, 5'
    },
    tooltip=folium.GeoJsonTooltip(
        fields=['name', 'loanAmount', 'scheduledPaymentAmount'],
        aliases=['State', 'Total Loan Amount', 'Total Scheduled Payments'],
        localize=True
    )
).add_to(m)

# Add layer control to toggle the choropleth layers
folium.LayerControl().add_to(m)

# Save to an HTML file and display
m.save('loan_amounts_and_payments_by_state.html')
m
        state   loanAmount
0        Ohio  41981445.00
1    Illinois  32071445.25
2    Missouri  22891087.50
3       Texas  21580228.25
4  California  21018530.50
        state  originallyScheduledPaymentAmount
0        Ohio                      1.241210e+08
1       Texas                      6.918227e+07
2    Illinois                      6.709159e+07
3  California                      6.104817e+07
4    Missouri                      6.092960e+07
Out[191]:
Make this Notebook Trusted to load map: File -> Trust Notebook

3.1.11. Lead Distribution and Cost Analysis by Type¶

The 'bvMandatory' type constitutes the overwhelming majority of leads, which may suggest either a significant focus on a particular marketing channel or a default categorization within the data. In contrast, other lead types such as 'organic', 'prescreen', and various named categories contribute to a smaller fraction, hinting at a more diverse, yet considerably less represented, origin of leads. The 'express', 'instant-offer', 'lionpay', and 'repeat' types register the least frequency, potentially indicating highly targeted lead generation strategies.

To support these observations, the group by analysis on 'leadType' against 'leadCost' provides valuable insights. The 'bvMandatory' type not only prevails in quantity but also in cost, with the highest total spend. Meanwhile, 'california' leads bear the highest average cost, potentially reflecting a higher value or cost of acquisition in that region.

In [192]:
# Calculate count, sum, mean, and median of LeadCost by LeadType
lead_type_stats = loan_df.groupby('leadType')['leadCost'].agg(Average = 'mean', Median = 'median', Total = 'sum').reset_index()

print(lead_type_stats)

# Count Plot for LeadType
plt.figure(figsize=(12, 6))
sns.countplot(data=loan_df, x='leadType', order=loan_df['leadType'].value_counts().index)
plt.title('Count of Leads by Type')
plt.xticks(rotation=45)
plt.xlabel('Lead Type')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
        leadType     Average  Median    Total
0    bvMandatory    4.654944     3.0  2211103
1     california  129.167015   120.0    61871
2        express    0.000000     0.0        0
3  instant-offer    0.000000     0.0        0
4           lead   31.050280    25.0  2256517
5        lionpay    0.000000     0.0        0
6        organic    0.343442     0.0     7848
7      prescreen    0.000000     0.0        0
8   rc_returning    0.000000     0.0        0
9         repeat    0.000000     0.0        0

3.1.12. Proportions of First Payment Statuses¶

The bar chart presents the distribution of the initial payment statuses for loans, highlighting 'Checked' as the predominant status, accounting for over half of the entries. This status likely indicates that the first payment has been verified or processed. Notably less frequent yet significant is 'No Payments', which may suggest loans where no initial payment was made or recorded. Other statuses such as 'Rejected', 'No Schedule', 'Cancelled', 'Skipped', 'Pending', and 'Returned' represent smaller proportions, reflecting the various outcomes of initial payment processing.

In [193]:
# Analyzing the impact of the first payment status (fpStatus) on loan outcomes
fp_status_counts = loan_df['fpStatus'].value_counts(normalize=True) * 100

# Plotting the distribution of the first payment status (fpStatus)
plt.figure(figsize=(10, 6))
fp_status_counts.plot(kind='bar', color='green')
plt.title('Distribution of First Payment Status')
plt.xlabel('First Payment Status')
plt.ylabel('Percentage')
plt.xticks(rotation=45)
plt.show()

3.1.13. Repeat Incidence Rate in Clarity Fraud ID Analysis¶

The data reflects that out of 357,693 entries in the loan dataset, there are 314,915 unique instances of clarityFraudId, indicating that approximately 11.96% of the entries are repeated incidents. This rate suggests that a significant minority of the loan applications might be associated with multiple fraud assessments, hinting at potential risks or inefficiencies that could warrant further investigation by the financial institution.

In [194]:
# Calculate the total count of clarityFraudId entries, including duplicates
total_clarityFraudId_count = loan_df['clarityFraudId'].count()

# Calculate the unique count of clarityFraudId
unique_clarityFraudId_count = loan_df['clarityFraudId'].nunique()

# Calculate the percentage of repeat incidents
repeat_incidence_rate = (1 - (unique_clarityFraudId_count / total_clarityFraudId_count)) * 100

print(f"Total count of clarityFraudId entries: {total_clarityFraudId_count}")
print(f"Unique count of clarityFraudId: {unique_clarityFraudId_count}")
print(f"Repeated Incidence Rate: {round(repeat_incidence_rate,2)}%")
Total count of clarityFraudId entries: 357693
Unique count of clarityFraudId: 314915
Repeated Incidence Rate: 11.96%

3.2. payment.csv¶

This dataset tracks the repayment behavior of borrowers, including whether it is a regular payment or part of a custom collection plan, the payment sequence, date, amounts (principal, fees, total), payment status, and any ACH error codes explaining payment failures. Analyzing this data helps in identifying patterns related to loan repayment and defaults, offering predictive insights into the likelihood of future loan performance based on historical payment behavior.

In [195]:
# Load the datasets
payment_df = pd.read_csv('/content/payment.csv',low_memory=False)

# Display the first few rows of each dataset to understand their structure
payment_df_head = payment_df.head()

# Display the shape of each dataset
payment_df_shape = payment_df.shape

payment_df_head, payment_df_shape
Out[195]:
(          loanId  installmentIndex  isCollection          paymentDate  \
 0  LL-I-00000021                 1         False  2014-12-19T05:00:00   
 1  LL-I-00000021                 2         False  2015-01-02T05:00:00   
 2  LL-I-00000021                 3         False  2015-01-16T05:00:00   
 3  LL-I-00000021                 4         False  2015-01-30T05:00:00   
 4  LL-I-00000021                 5         False  2015-02-13T05:00:00   
 
    principal    fees  paymentAmount paymentStatus paymentReturnCode  
 0      22.33  147.28         169.61       Checked               NaN  
 1      26.44  143.17         169.61       Checked               NaN  
 2      31.30  138.31         169.61       Checked               NaN  
 3      37.07  132.54         169.61       Checked               NaN  
 4      43.89  125.72         169.61       Checked               NaN  ,
 (689364, 9))

3.2.1. Proportion of Payments in Collection Status¶

The pie chart illustrates the distribution of collection status within a set of payments, with a vast majority, 97.98%, not marked for collection (False), while only a small fraction, 2.02%, are in the collection process (True). This suggests that the bulk of payments are proceeding as expected without the need for recovery efforts.

In [196]:
# Calculate the count of True vs False in 'isCollection'
collection_counts = payment_df['isCollection'].value_counts()

# Create a pie chart
plt.figure(figsize=(8, 8))
collection_counts.plot.pie(autopct='%1.2f%%', startangle=90, counterclock=False)
plt.title('Distribution of Collection Status')
plt.ylabel('')  # Hide the y-label as it's not necessary for pie charts

# Show the pie chart
plt.show()

3.2.2. Average and Frequency of Installment Loan by Collection Status¶

The histogram portrays the frequency of installment counts for loans, segmented by collection status. Most loans have fewer than 15 installments, with a noticeable peak at 14 installments. There's a significant drop in frequency as the installment count increases, with only a small number reaching 30 or more installments. Notably, the number of loans that have entered collection is relatively small across all installment counts.

The average duration of installments across the loan data set stands at 185 days. This figure serves as a central measure indicating the typical period over which borrowers are scheduled to repay their loans.

In [197]:
# Count the number of installment indices for each loanId
installment_count = payment_df.groupby('loanId')['installmentIndex'].count().reset_index(name='installment_count')

# Merge this count back into the original dataframe
payment_df = payment_df.merge(installment_count, on='loanId', how='left')

# For counts of 30 and above, group into '30+'
payment_df['installment_count_group'] = payment_df['installment_count'].apply(lambda x: '30+' if x >= 30 else x)

# Now create the frequency table for plotting
frequency_table = payment_df.pivot_table(index='installment_count_group', columns='isCollection', aggfunc='size', fill_value=0)

# Plot the stacked bar chart
ax = frequency_table.plot(kind='bar', stacked=True, figsize=(14, 7), width=0.8)
ax.set_title('Frequency of Installment Index Count per Loan ID by Collection Status')
ax.set_xlabel('Installment Index Count')
ax.set_ylabel('Frequency')
ax.set_xticklabels(frequency_table.index, rotation=0)

plt.show()
In [198]:
# Convert the paymentDate column to datetime
payment_df['paymentDate'] = pd.to_datetime(payment_df['paymentDate'])

# Calculate the difference between the first and last payment dates for each loanId
payment_duration = payment_df.groupby('loanId')['paymentDate'].agg(['min', 'max'])
payment_duration['installmentDuration'] = (payment_duration['max'] - payment_duration['min']).dt.days

# Merge the calculated duration back to the original dataframe
payment_df = payment_df.merge(payment_duration['installmentDuration'], on='loanId', how='left')

mean_duration = round(payment_df['installmentDuration'].mean(), 0)

print(f"Average Duration: {mean_duration}")
Average Duration: 185.0

3.2.3. Distribution of Total Payments Per Loan ID¶

The histograms present a distribution of the total payment amounts aggregated per loan ID. The left chart shows the distribution on a standard scale, while the right chart employs a logarithmic scale to better visualize the distribution across a wide range of values. This analysis offers insights into the overall repayment behavior of borrowers.

In [199]:
# Sum the paymentAmount for each loanId
summed_payments = payment_df.groupby('loanId')['paymentAmount'].sum().reset_index(name='total_payment_amount')

fig, axs = plt.subplots(1, 2, figsize=(18, 7))

# Normal scale histogram
axs[0].hist(summed_payments['total_payment_amount'], bins=30, edgecolor='black')
axs[0].set_title('Histogram of Total Payment Amount per Loan ID')
axs[0].set_xlabel('Total Payment Amount')
axs[0].set_ylabel('Frequency of Loan IDs')

# Log scale histogram
axs[1].hist(summed_payments['total_payment_amount'], bins=30, edgecolor='black', log=True)
axs[1].set_title('Log-Scale Histogram of Total Payment Amount per Loan ID')
axs[1].set_xlabel('Total Payment Amount')
axs[1].set_ylabel('Log-Frequency of Loan IDs')

plt.tight_layout()
plt.show()

3.2.4. Distribution of Good vs Bad Payment¶

The bar chart is a visual representation of pay quality categorized by various payment statuses. It indicates a stacked comparison of the count of loans classified as either 'Good' or 'Bad', with each section in the bars representing different payment statuses such as 'Cancelled', 'Checked', 'Complete', and so on. The 'Good' category has a noticeably higher count compared to the 'Bad' category, with diverse statuses contributing to each. Good payments account for 55.55%, while bad payments constitute 44.45%.

In [200]:
# Define the good loan statuses
good_loan_statuses = ['Checked', 'None', 'Complete', 'Pending']

# Classify paymentStatus into goodLoan and badLoan
payment_df['payQuality'] = payment_df['paymentStatus'].apply(
    lambda x: 'Good' if x in good_loan_statuses else 'Bad'
)

# Count the occurrences of each pay quality type
pay_quality_counts = payment_df['payQuality'].value_counts()

# Detach the 'Bad' slice for emphasis
explode = (0, 0.1) if 'Bad' in pay_quality_counts else (0,)

# Create the pie chart
plt.figure(figsize=(8, 8))
plt.pie(pay_quality_counts, labels=pay_quality_counts.index, autopct='%1.2f%%', startangle=90, explode=explode)
plt.title('Distribution of Good vs Bad Payment')
plt.axis('equal')
plt.show()

# Create a crosstabulation or frequency table
loan_quality_status = pd.crosstab(payment_df['payQuality'], payment_df['paymentStatus'])

# Plot a stacked bar chart
loan_quality_status.plot(kind='bar', stacked=True, figsize=(10, 7))

# Add labels and title
plt.xlabel('Pay Quality')
plt.ylabel('Count')
plt.title('Pay Quality Stacked with Payment Status')
plt.legend(title='Payment Status',bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0)

# Show plot
plt.show()

3.2.5. Frequency of ACH Return Codes R01 to R33.¶

This graph is a visual representation of the frequency of various Automated Clearing House (ACH) return codes, which are standardized codes used to indicate the reasons for returned or failed ACH transactions. Each bar on the graph corresponds to a specific return code, ranging from R01 to R33, with the height of the bar indicating the frequency of that particular code occurring within a dataset. Notably, R01, which typically signifies "Insufficient Funds," appears to have the highest frequency by a significant margin, suggesting it is a common reason for ACH transaction failures in the analyzed dataset. Other return codes are present but with much lower frequencies.

In [201]:
# Create a list of desired return codes
valid_codes = [f'R{str(i).zfill(2)}' for i in range(1, 34)]

# Replace return codes not in the valid list with NaN
payment_df['ACHReturnCode'] = payment_df['paymentReturnCode'].apply(lambda x: x if x in valid_codes else np.nan)

# Count the occurrences of each return code
return_code_counts = payment_df['ACHReturnCode'].value_counts(dropna=True).sort_index()

# Plot the results
plt.figure(figsize=(14, 7))
return_code_counts.plot(kind='bar', color='skyblue', edgecolor='black')
plt.title('Frequency of ACH Return Codes R01 to R33')
plt.xlabel('ACH Return Code')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.tight_layout()  # Adjust layout to make room for x-axis labels
plt.show()

3.3. clarity_underwriting_variables.csv¶

This dataset contains variables derived from Clarity, a credit reporting agency, which are used in the underwriting process to assess the creditworthiness of applicants. The variables include credit scores, fraud indicators, and other risk-related factors that can significantly impact the decision to approve or decline a loan application.

In [202]:
# Load the datasets
clarity_df = pd.read_csv('/content/clarity_underwriting_variables.csv',low_memory=False)

# Create a dictionary for renaming columns
renamed_columns = {col: col.split('.')[-1] if '.' in col else col for col in clarity_df.columns}

# Rename the columns
clarity_df.rename(columns=renamed_columns, inplace=True)

# Display the first few rows of each dataset to understand their structure
clarity_df_head = clarity_df.head()

# Display the shape of each dataset
clarity_df_shape = clarity_df.shape

clarity_df_head, clarity_df_shape
Out[202]:
(   thirtydaysago  twentyfourhoursago  oneminuteago  onehourago  ninetydaysago  \
 0            8.0                 2.0           2.0         2.0            8.0   
 1            5.0                 2.0           2.0         2.0           11.0   
 2            9.0                 4.0           2.0         3.0           10.0   
 3            3.0                 2.0           2.0         2.0            9.0   
 4            5.0                 5.0           2.0         2.0            6.0   
 
    sevendaysago  tenminutesago  fifteendaysago  threesixtyfivedaysago  \
 0           2.0            2.0             5.0                   10.0   
 1           2.0            2.0             4.0                   21.0   
 2           8.0            2.0             9.0                   25.0   
 3           2.0            2.0             2.0                    9.0   
 4           5.0            2.0             5.0                    6.0   
 
   inquiryonfilecurrentaddressconflict  ...  phonematchtypedescription  \
 0                               False  ...           (M) Mobile Phone   
 1                                True  ...           (M) Mobile Phone   
 2                               False  ...           (M) Mobile Phone   
 3                               False  ...           (M) Mobile Phone   
 4                               False  ...           (M) Mobile Phone   
 
   overallmatchresult phonetype ssndobreasoncode ssnnamereasoncode  \
 0            partial       NaN              NaN               NaN   
 1            partial       NaN              NaN               NaN   
 2              match       NaN              NaN               NaN   
 3            partial       NaN              NaN               NaN   
 4              match       NaN              NaN               NaN   
 
   nameaddressreasoncode ssndobmatch overallmatchreasoncode clearfraudscore  \
 0                    A8       match                    6.0           871.0   
 1                   NaN       match                   11.0           397.0   
 2                   NaN       match                    1.0           572.0   
 3                   NaN       match                   11.0           838.0   
 4                   NaN       match                    1.0           768.0   
 
              underwritingid  
 0  54cbffcee4b0ba763e43144d  
 1  54cc0408e4b0418d9a7f78af  
 2  54cc0683e4b0418d9a80adb6  
 3  54cc0780e4b0ba763e43b74a  
 4  54cc1d67e4b0ba763e445b45  
 
 [5 rows x 54 columns],
 (49752, 54))

3.3.1. Correlation Matrix of Fraud Inquiry Related Variables¶

The heatmap showcasing the relationships between different periods of fraud inquiry frequency. The most pronounced positive correlations appear between inquiries made fifteen days apart and those within thirty days, indicative of patterns in the timing of fraud inquiries that could be pivotal for detecting and preventing fraudulent activities. The matrix effectively highlights timeframes with the highest correlation, potentially indicating periods of increased risk for fraudulent behavior.

In [203]:
fraud_inquiry_columns = [
    'oneminuteago', 'tenminutesago', 'onehourago',
    'twentyfourhoursago', 'sevendaysago', 'fifteendaysago',
    'thirtydaysago', 'ninetydaysago','threesixtyfivedaysago'
]

# Select only the fraud inquiry related columns
fraud_inquiry_df = clarity_df[fraud_inquiry_columns]

# Compute the correlation matrix
correlation_matrix = fraud_inquiry_df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

# Set up the matplotlib figure
plt.figure(figsize=(11, 9))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(correlation_matrix, mask=mask, cmap='coolwarm', vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)

# Add titles and labels for clarity
plt.title('Correlation Matrix of Fraud Inquiry Related Variables')
plt.show()

3.3.2. Correlation Matrix of Fraud Indicator Variables¶

The matrix is organized to show correlation coefficients between fraud indicator variables, such as the consistency of telephone numbers with addresses and the frequency of inquiry on a file relative to its age. Notably, there is a particularly strong positive correlation, indicated by dark red, between variables such as 'highprobabilityssnbelongstoanother' and 'maxnumberofssnswithanybankaccount', which may suggest a pattern of behavior or data points that are significantly linked to potential fraudulent activity.

In [204]:
# List of fraud indicator columns that need conversion from True/False to 1/0
fraud_indicators_columns = [
    'inquiryonfilecurrentaddressconflict',
    'totalnumberoffraudindicators',
    'telephonenumberinconsistentwithaddress',
    'inquiryageyoungerthanssnissuedate',
    'onfileaddresscautious',
    'inquiryaddressnonresidential',
    'onfileaddresshighrisk',
    'ssnreportedmorefrequentlyforanother',
    'currentaddressreportedbytradeopenlt90days',
    'inputssninvalid',
    'inputssnissuedatecannotbeverified',
    'inquiryaddresscautious',
    'morethan3inquiriesinthelast30days',
    'onfileaddressnonresidential',
    'creditestablishedpriortossnissuedate',
    'driverlicenseformatinvalid',
    'inputssnrecordedasdeceased',
    'inquiryaddresshighrisk',
    'inquirycurrentaddressnotonfile',
    'bestonfilessnissuedatecannotbeverified',
    'highprobabilityssnbelongstoanother',
    'maxnumberofssnswithanybankaccount',
    'bestonfilessnrecordedasdeceased',
    'currentaddressreportedbynewtradeonly',
    'creditestablishedbeforeage18',
    'telephonenumberinconsistentwithstate',
    'driverlicenseinconsistentwithonfile',
    'workphonepreviouslylistedascellphone',
    'workphonepreviouslylistedashomephone'
]

# Ensure the column exists before attempting to modify it
for column in fraud_indicators_columns:
    if column in clarity_df.columns:
        # Convert True/False to 1/0
        clarity_df[column] = clarity_df[column].apply(lambda x: 1 if x == True else (0 if x == False else x))

# Display the first few rows to verify the conversion
clarity_df.head()
Out[204]:
thirtydaysago twentyfourhoursago oneminuteago onehourago ninetydaysago sevendaysago tenminutesago fifteendaysago threesixtyfivedaysago inquiryonfilecurrentaddressconflict ... phonematchtypedescription overallmatchresult phonetype ssndobreasoncode ssnnamereasoncode nameaddressreasoncode ssndobmatch overallmatchreasoncode clearfraudscore underwritingid
0 8.0 2.0 2.0 2.0 8.0 2.0 2.0 5.0 10.0 0.0 ... (M) Mobile Phone partial NaN NaN NaN A8 match 6.0 871.0 54cbffcee4b0ba763e43144d
1 5.0 2.0 2.0 2.0 11.0 2.0 2.0 4.0 21.0 1.0 ... (M) Mobile Phone partial NaN NaN NaN NaN match 11.0 397.0 54cc0408e4b0418d9a7f78af
2 9.0 4.0 2.0 3.0 10.0 8.0 2.0 9.0 25.0 0.0 ... (M) Mobile Phone match NaN NaN NaN NaN match 1.0 572.0 54cc0683e4b0418d9a80adb6
3 3.0 2.0 2.0 2.0 9.0 2.0 2.0 2.0 9.0 0.0 ... (M) Mobile Phone partial NaN NaN NaN NaN match 11.0 838.0 54cc0780e4b0ba763e43b74a
4 5.0 5.0 2.0 2.0 6.0 5.0 2.0 5.0 6.0 0.0 ... (M) Mobile Phone match NaN NaN NaN NaN match 1.0 768.0 54cc1d67e4b0ba763e445b45

5 rows × 54 columns

In [205]:
# Filter the DataFrame for the fraud indicators columns only
fraud_ind_df = clarity_df[fraud_indicators_columns]

# Compute the correlation matrix
correlation_matrix = fraud_ind_df.corr()

# Generate a mask for the upper triangle (optional, for visualization purposes)
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

# Set up the matplotlib figure
plt.figure(figsize=(14, 12))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(correlation_matrix, mask=mask, cmap='coolwarm', vmax=1, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True, fmt=".2f")

# Add titles and labels for clarity
plt.title('Correlation Matrix of Fraud Indicator Variables')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()  # Adjust layout to make room for the x-axis labels

# Show the plot
plt.show()

3.3.3. Distribution of Clear Fraud Scores by Match Reason Code and Phone Type¶

The scatter plot suggests that there is no clear, discernible relationship between the Clear Fraud Score and the Overall Match Reason Codes across different phone match types. The points are widely dispersed and do not form distinct patterns or clusters that would indicate a strong correlation. Each type of phone match, whether it's a mobile phone or a pager, appears to have a similar spread of fraud scores across the range of match reason codes. This lack of a visible trend suggests that the Clear Fraud Score is not strongly dependent on the match reason codes or the phone match type.

In [206]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=clarity_df, x='overallmatchreasoncode', y='clearfraudscore', hue='phonematchtypedescription')
plt.title('')
plt.xlabel('Overall Match Reason Code')
plt.ylabel('Clear Fraud Score')
plt.legend(title='Phone Match Type Description', bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0)
plt.show()

3.4. Data Preprocessing¶

The data preprocessing section is dedicated to preparing the datasets for subsequent analytical tasks, ensuring data cleanliness, consistency, and suitability for generating reliable insights. The methodical preprocessing steps underscored in these subsections lay a robust foundation for the data, enhancing its analytical utility. By systematically converting, encoding, engineering, and cleaning the data, a dataset primed for insightful analysis and predictive modeling is achieved. These preparatory actions not only increase the dataset's value for analysis but also align it with the methodological requirements of various modeling techniques, setting the stage for in-depth data exploration and robust predictive modeling in the study's subsequent phases. This process is divided into four detailed subsections, each addressing specific preprocessing needs.

3.4.1. Categorical Strings to Numerical Conversion¶

This subsection elaborates on the initial data preparation steps applied to the loan_df. Specifically, it involves converting categorical string values into more analytically suitable formats—boolean and integer. The conversion targets three key variables: 'originated', 'approved', and 'isFunded'. These variables, initially stored as strings ('Yes', 'No'), are mapped to binary representations (1 for 'Yes', 0 for 'No'). Additionally, to further streamline the dataset for predictive modeling, the loanQuality attribute, which categorizes loans based on their risk assessment, is transformed into a binary variable named GoodLoan. Loans classified as 'Good' are labeled with a 1, while all others are marked as 0.

This transformation is crucial for two reasons: first, it aligns the data with the requirements of various statistical models that necessitate numerical input; second, it simplifies the dataset, enhancing clarity and reducing complexity for future analyses.

In [207]:
# Convert string values to boolean/integer
loan_df['originated'] = loan_df['originated'].replace({'Yes': 1, 'No': 0})
loan_df['approved'] = loan_df['approved'].replace({'Yes': 1, 'No': 0})
loan_df['isFunded'] = loan_df['isFunded'].replace({'Yes': 1, 'No': 0})
loan_df['GoodLoan'] = loan_df['loanQuality'].apply(lambda x: 1 if x == 'Good' else 0)

# Converting categorical variables to dummy variables
categorical_columns = ['payFrequency', 'leadType', 'fpStatus']
loan_df_dummies = pd.get_dummies(loan_df[categorical_columns], drop_first=True)

# Joining the dummy variables with the original dataframe
loan_df = loan_df.drop(columns=categorical_columns).join(loan_df_dummies)

# Display the new dataframe structure
loan_df.head()
Out[207]:
loanId anon_ssn apr applicationDate originated originatedDate nPaidOff approved isFunded loanStatus ... leadType_prescreen leadType_rc_returning leadType_repeat fpStatus_Checked fpStatus_No Payments fpStatus_No Schedule fpStatus_Pending fpStatus_Rejected fpStatus_Returned fpStatus_Skipped
0 LL-I-07399092 beff4989be82aab4a5b47679216942fd 360.0 2016-02-23 17:29:01.940 0 NaT 0.0 0 0 Withdrawn Application ... 0 0 0 0 0 0 0 0 0 0
1 LL-I-06644937 464f5d9ae4fa09ece4048d949191865c 199.0 2016-01-19 22:07:36.778 1 2016-01-20 15:49:18.846 0.0 1 1 Paid Off Loan ... 1 0 0 1 0 0 0 0 0 0
2 LL-I-10707532 3c174ae9e2505a5f9ddbff9843281845 590.0 2016-08-01 13:51:14.709 0 NaT 0.0 0 0 Withdrawn Application ... 0 0 0 0 0 0 0 0 0 0
3 LL-I-02272596 9be6f443bb97db7e95fa0c281d34da91 360.0 2015-08-06 23:58:08.880 0 NaT 0.0 0 0 Withdrawn Application ... 0 0 0 0 0 0 0 0 0 0
4 LL-I-09542882 63b5494f60b5c19c827c7b068443752c 590.0 2016-06-05 22:31:34.304 0 NaT 0.0 0 0 Rejected ... 0 0 0 0 0 0 0 0 0 0

5 rows × 39 columns

3.4.2. Categorical Variables Encoding and Dimensionality Reduction¶

In this step, the focus shifts to the payment_df, where significant transformations are undertaken to align with the analytical framework. The approach includes encoding the 'isCollection' variable to represent collection status as binary indicators ('collection_false'), and similarly transforming 'payQuality' into binary indicators ('good_pay') based on the condition of the payment status. This binary encoding simplifies the representation of categorical status, facilitating its direct application in predictive modeling.

Following the creation of these binary indicators, a selection of columns deemed less relevant or redundant for the upcoming analysis—such as 'installmentIndex', 'isCollection', 'paymentDate', among others—are removed from the dataset. This step streamlines the dataset, focusing on variables with greater predictive value or analytical interest.

A critical part of this subsection involves grouping the data by 'loanId'. This aggregation is pivotal, as it consolidates payment information at the loan level, allowing for a comprehensive view of each loan's payment history and status. The aggregation functions applied—such as taking the last value for 'installment_count' and 'installmentDuration', and summing the binary indicators—serve to capture the essence of the loan's payment performance and collection activities in a condensed form.

By grouping and summarizing payment data, the dataset is enriched with aggregated metrics that reflect the payment behavior and quality of loans on an individual loan basis. These transformations are instrumental in preparing the payment_df for integration with other datasets, ensuring that subsequent analyses can leverage a holistic view of each loan's characteristics and history.

In [208]:
# Add new columns based on the conditions provided
payment_df['collection_false'] = payment_df['isCollection'].apply(lambda x: 0 if x else 1)
payment_df['good_pay'] = payment_df['payQuality'].apply(lambda x: 1 if x == 'Good' else 0)

columns_to_remove = [
    'installmentIndex', 'isCollection', 'paymentDate', 'principal',
    'fees', 'paymentAmount', 'paymentStatus', 'paymentReturnCode',
    'installment_count_group', 'payQuality', 'ACHReturnCode'
]

# Assuming 'payment_df' is your DataFrame
payment_df = payment_df.drop(columns=columns_to_remove)

# Group by 'loanId' and calculate the last value, sum, and counts
payment_df = payment_df.groupby('loanId').agg({
    'installment_count': 'last',
    'installmentDuration': 'last',
    'collection_false': 'sum',
    'good_pay': 'sum'
}).reset_index()

# Display the DataFrame to confirm the columns are removed
payment_df.head()
Out[208]:
loanId installment_count installmentDuration collection_false good_pay
0 LL-I-00000021 13 496 12 12
1 LL-I-00000231 20 154 20 10
2 LL-I-00000238 16 153 16 11
3 LL-I-00000308 22 156 22 11
4 LL-I-00000355 23 153 23 11

3.4.3. Feature Engineering in Clarity Underwriting Variables¶

In the clarity underwriting dataset (clarity_df), significant feature engineering efforts are undertaken to extract and highlight valuable information from existing variables. This process includes creating binary variables from categorical data, such as phone types and specific codes related to SSN and DOB discrepancies. Additionally, match results and other categorical assessments are numerically encoded, transforming qualitative judgments into quantifiable metrics.

These efforts are not merely technical transformations but are designed to distill and amplify the predictive signals contained within the underwriting data. By converting nuanced categorical information into binary indicators, the dataset becomes more amenable to statistical analysis, enabling more precise modeling of loan outcomes based on underwriting variables. This subsection underscores the methodological rigor applied to prepare the underwriting dataset for integration and analysis, highlighting the thoughtful consideration given to preserving and enhancing the data's predictive value.

In [209]:
# Converting categorical variables to dummy variables
categorical_columns1 = ['phonetype','ssndobreasoncode', 'ssnnamereasoncode','phonematchtype','nameaddressreasoncode']
clarity_df_dummies = pd.get_dummies(clarity_df[categorical_columns1], drop_first=True)

# Joining the dummy variables with the original dataframe
clarity_df = clarity_df.drop(columns=categorical_columns1).join(clarity_df_dummies)

# List of columns to apply the conversion
columns_to_convert = [
    'ssnnamematch',
    'nameaddressmatch',
    'phonematchresult',
    'overallmatchresult',
    'ssndobmatch'
]

def match_to_numeric(value):
    return 1 if value == 'match' else 0

# Apply the conversion to each specified column
for col in columns_to_convert:
    if col in clarity_df.columns:
        clarity_df[col] = clarity_df[col].apply(match_to_numeric)

# Define the columns to be deleted
columns_to_delete = [
    'ssnnamereasoncodedescription',
    'nameaddressreasoncodedescription',
    'phonematchtypedescription',
    'inputssnrecordedasdeceased',
    'bestonfilessnrecordedasdeceased',
    'bestonfilessnissuedatecannotbeverified',
    'inputssninvalid']

# Delete the columns from the DataFrame
clarity_df.drop(columns=columns_to_delete, inplace=True)

# Display the modified DataFrame
clarity_df.head()
Out[209]:
thirtydaysago twentyfourhoursago oneminuteago onehourago ninetydaysago sevendaysago tenminutesago fifteendaysago threesixtyfivedaysago inquiryonfilecurrentaddressconflict ... ssndobreasoncode_D06 ssndobreasoncode_D07 ssnnamereasoncode_S07 phonematchtype_F phonematchtype_FA phonematchtype_L phonematchtype_LA phonematchtype_M phonematchtype_P phonematchtype_U
0 8.0 2.0 2.0 2.0 8.0 2.0 2.0 5.0 10.0 0.0 ... 0 0 0 0 0 0 0 1 0 0
1 5.0 2.0 2.0 2.0 11.0 2.0 2.0 4.0 21.0 1.0 ... 0 0 0 0 0 0 0 1 0 0
2 9.0 4.0 2.0 3.0 10.0 8.0 2.0 9.0 25.0 0.0 ... 0 0 0 0 0 0 0 1 0 0
3 3.0 2.0 2.0 2.0 9.0 2.0 2.0 2.0 9.0 0.0 ... 0 0 0 0 0 0 0 1 0 0
4 5.0 5.0 2.0 2.0 6.0 5.0 2.0 5.0 6.0 0.0 ... 0 0 0 0 0 0 0 1 0 0

5 rows × 57 columns

3.4.2. Merging Datasets and Finalizing Data Cleaning¶

The final subsection details the integration of the previously prepared loan_df and payment_df with the clarity_df, based on key identifiers ('loanId' and 'clarityFraudId'). This merging process creates a unified dataset that encapsulates the loan application process, payment history, and underwriting insights in a single analytical framework.

Post-merging, additional data cleaning steps are undertaken, including the removal of columns that are redundant, irrelevant, or could introduce bias into subsequent analyses. This careful pruning of the dataset ensures that the remaining variables are directly relevant to the study's objectives.

Moreover, the subsection describes the handling of missing values, a critical aspect of data preparation. Numerical columns are imputed with the median of their respective distributions, while categorical columns are filled with the mode. These imputation strategies are chosen to preserve the underlying distributional characteristics of the data while ensuring completeness.

In [210]:
# Join loan_df and payment_df on 'loanId'
combined_df = pd.merge(loan_df, payment_df, on='loanId', how='left')

# The clarity_underwriting_variables.csv is expected to be joined on 'clarityFraudId'
combined_df = pd.merge(combined_df, clarity_df, left_on='clarityFraudId', right_on='underwritingid', how='left')

# Display the first few rows of the combined dataframe to verify the join
combined_df_head = combined_df.head()

columns_to_remove = [
    'loanId',
    'anon_ssn',
    'originatedDate',
    'applicationDate',
    'loanStatus',
    'loanQuality',
    'state',
    'underwritingid',
    'clarityFraudId'
]

# Assuming combined_df is your DataFrame
combined_df.drop(columns=columns_to_remove, inplace=True)

# Fill NA values in numerical columns with the median
numerical_columns = combined_df.select_dtypes(include=['number']).columns
combined_df[numerical_columns] = combined_df[numerical_columns].fillna(combined_df[numerical_columns].median())

# Automatically identify categorical columns and fill NA values with the mode
categorical_columns = combined_df.select_dtypes(include=['object', 'category']).columns
for col in categorical_columns:
    combined_df[col] = combined_df[col].fillna(combined_df[col].mode()[0])

# Check if there are any remaining missing values
print(combined_df.isnull().sum())

combined_df.head()
# combined_df.to_csv("finalDataset_LWJ.csv", index=False)
apr                  0
originated           0
nPaidOff             0
approved             0
isFunded             0
                    ..
phonematchtype_L     0
phonematchtype_LA    0
phonematchtype_M     0
phonematchtype_P     0
phonematchtype_U     0
Length: 91, dtype: int64
Out[210]:
apr originated nPaidOff approved isFunded loanAmount originallyScheduledPaymentAmount leadCost hasCF days_to_originate ... ssndobreasoncode_D06 ssndobreasoncode_D07 ssnnamereasoncode_S07 phonematchtype_F phonematchtype_FA phonematchtype_L phonematchtype_LA phonematchtype_M phonematchtype_P phonematchtype_U
0 360.0 0 0.0 0 0 500.0 978.27 6 1 0.097036 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
1 199.0 1 0.0 1 1 3000.0 6395.19 0 1 0.737292 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 590.0 0 0.0 0 0 400.0 1199.45 3 1 0.097036 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
3 360.0 0 0.0 0 0 500.0 1074.05 3 1 0.097036 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
4 590.0 0 0.0 0 0 350.0 814.37 3 1 0.097036 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

5 rows × 91 columns

4. Result¶

This section explores the feature selection process undertaken to identify the most predictive features for determining loan funding outcomes. Feature selection is pivotal as it enhances model performance by concentrating on the most informative variables, thereby reducing model complexity and potentially improving prediction accuracy.

4.1. Feature Selection¶

Feature selection involved a combination of univariate feature selection and RandomForest classifier's feature importance ranking. Despite the known effectiveness of Boruta and RFECV (Recursive Feature Elimination with Cross-Validation) in feature selection, these methods were not applied due to their computational demands. Given the large scale of the dataset and the extensive number of features, these methods' processing times were deemed prohibitive. Thus, the analysis proceeded with the quicker SelectKBest and RandomForest feature importance methods to ascertain relevant features for predicting loan funding success.

The insights from this section emphasize the critical role of feature selection in modeling, identifying key variables that may significantly influence the prediction of loan funding. Concentrating on the most informative features enables more focused and potentially more successful modeling efforts, establishing a groundwork for detailed analyses and robust predictive outcomes.

4.1.1. Training-Test Split & Feature Scaling¶

In predictive modeling, the dataset is first organized into features (X) and the target variable (y), distinguishing the data used for predictions from the outcome being predicted. For this task, the outcome of interest is GoodLoan, indicating whether a loan is classified as good based on its quality and repayment prospects.

A crucial step in the preparation process involves splitting the dataset into training and testing sets. This split is conducted with a 70:30 ratio, allocating 70% of the data for training and the remaining 30% for testing. This stratified division ensures both subsets accurately reflect the overall distribution of the target variable, which is essential for training reliable models and evaluating their performance on unseen data.

Following the division, feature scaling is applied through StandardScaler, standardizing features to have a mean of zero and a unit variance. This normalization step is vital for algorithms sensitive to feature scale, ensuring equal contribution of all features to the model.

In [211]:
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest, f_classif, RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler

# Separate the dataset into features (X) and the target variable (y)
X = combined_df.drop('GoodLoan', axis=1)
y = combined_df['GoodLoan']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

4.1.1. Univariate Feature Selection¶

The Univariate Feature Selection analysis conducted using SelectKBest with the f_classif criterion has pinpointed the top 10 features with the greatest statistical significance for predicting the GoodLoan outcome. This method evaluates each feature individually, without considering interactions between them, to determine their impact on the target variable. The highest-ranking feature, isFunded, stands out with an unparalleled score, suggesting its critical influence on loan quality assessments. Following it, features like approved and originated highlight the fundamental importance of the approval process and the initiation of a loan in forecasting loan performance. Early payment behaviors, as indicated by fpStatus_Checked and fpStatus_Rejected, are also identified as key predictors, emphasizing the relevance of initial payment outcomes. The selection encompasses other important features such as installment_count, nameaddressmatch, and collection_false, which offer insights into repayment schedules, identity verification, and collection status. Additionally, inquiryonfilecurrentaddressconflict and overallmatchresult contribute to the model by reflecting the borrower's inquiry history and overall credit match, respectively. By isolating and ranking these features based on their individual scores, the Univariate Feature Selection process effectively lays the groundwork for developing a predictive model that enhances loan quality predictions, thereby supporting informed lending and risk management decisions.

In [212]:
select_k_best = SelectKBest(f_classif, k=10).fit(X_train_scaled, y_train)
selected_features_k_best = select_k_best.get_support(indices=True)

# Retrieve feature names
feature_names = X_train.columns

# Get the scores for each feature
feature_scores = select_k_best.scores_

# Combine indices of selected features with their names and scores
selected_features_info = [(feature_names[i], feature_scores[i]) for i in selected_features_k_best]

# Sort the selected features by score in descending order
selected_features_info_sorted = sorted(selected_features_info, key=lambda x: x[1], reverse=True)

selected_features_df = pd.DataFrame(selected_features_info_sorted, columns=['Feature Name', 'Score'])
selected_features_df.index += 1
selected_features_df
Out[212]:
Feature Name Score
1 isFunded 913861.106569
2 approved 836059.914003
3 originated 557521.194536
4 fpStatus_Checked 422776.971040
5 fpStatus_Rejected 88309.069451
6 installment_count 81485.746474
7 nameaddressmatch 79591.292430
8 collection_false 74399.031233
9 inquiryonfilecurrentaddressconflict 61660.711865
10 overallmatchresult 60815.895535

4.1.2. Feature Importance from RandomForest¶

Feature importances are extracted from the trained RandomForest model, sorted, and matched with feature names. The RandomForest Classifier reveals the top 10 features determining loan quality. The model underscores 'isFunded' as the paramount indicator, with 'approved' and 'originated' also crucial, reflecting key loan processing stages. 'good_pay' and 'fpStatus_Checked' follow, highlighting the borrower's payment history and status verification as significant. Repayment terms, captured by 'installmentDuration' and 'installment_count', along with 'clearfraudscore', which may assess fraud risk, are also influential. Features like 'collection_false' and 'days_to_originate' round out the top 10, indicating the importance of collection activity and the loan origination period. This analysis provides insight into the features most predictive of loan quality, offering guidance for risk management and lending decisions.

In [213]:
forest = RandomForestClassifier(random_state=42)
forest.fit(X_train_scaled, y_train)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]

feature_names = X_train.columns

# Get sorted feature importances
sorted_idx = indices  # Or directly use: np.argsort(forest.feature_importances_)[::-1]
sorted_importances = importances[sorted_idx]

# Get sorted feature names for the top 10 features
sorted_features_top10 = [feature_names[i] for i in sorted_idx[:10]]

# Plot
plt.figure(figsize=(10, 6))
plt.barh(range(10), sorted_importances[:10], align='center')
plt.yticks(range(10), sorted_features_top10)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Top 10 Feature Importances from RandomForest')
plt.gca().invert_yaxis()  # Invert y-axis to have the most important feature at the top
plt.show()

4.1.3. Key Features¶

The intersection of features identified by SelectKBest and the RandomForest Classifier has yielded a concise list of variables that hold significant predictive power for the GoodLoan target variable. The common features include 'collection_false', 'originated', 'approved', 'isFunded', 'fpStatus_Checked', and 'installment_count'. These features provide a comprehensive overview of factors influencing the likelihood of a loan being considered 'Good'.

The 'isFunded' feature, shared by both feature selection methods, indicates the fundamental importance of the funding status in the definition of a 'Good' loan. A funded loan is presumably one that has passed initial credit assessments, making it a prime indicator of loan quality. The 'approved' status further confirms that the loan has met all the necessary criteria set by the lender, underscoring its quality. The 'originated' status signifies the commencement of the loan agreement, reflecting the lender's confidence in the borrower's ability to repay.

The 'fpStatus_Checked' feature suggests that the initial payment check is a critical step in ensuring the borrower's reliability and thus influences the loan's classification as 'Good'. Similarly, 'installment_count' offers insights into the repayment structure, with a higher number of installments potentially indicating manageable repayment plans and a lower risk profile. Conversely, 'collection_false' indicates loans not in collection, typically associated with lower risk and better loan performance. Together, these features present a multifaceted picture of loan quality, with each element contributing to the holistic assessment of what constitutes a 'GoodLoan'. This alignment between the analytical findings and the practical aspects of loan approval and management underscores the robustness of the model and its relevance for real-world applications in the lending industry.

In [214]:
common_features = set(selected_features_k_best).intersection(indices[:10])
# Convert the set of common features to a list to use it for indexing
common_features_list = list(common_features)

# Filter the list of feature names to only include those identified as common
features = combined_df.drop('GoodLoan', axis=1).columns.tolist()
common_features_names = [features[i] for i in common_features_list]
common_features_names
Out[214]:
['collection_false',
 'originated',
 'approved',
 'isFunded',
 'fpStatus_Checked',
 'installment_count']

4.2. Predictive Analysis¶

For evaluating machine learning models, an evaluation function was created to calculate and display metrics such as accuracy, confusion matrix, precision, recall, and F1 score. This function standardizes the evaluation process across different models, ensuring a consistent approach to performance measurement.

In [215]:
# Create subsets of the training and testing data with only the common features
X_train_common = X_train_scaled[:, common_features_list]
X_test_common = X_test_scaled[:, common_features_list]

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score
import pandas as pd

# Function to calculate and display evaluation metrics
def evaluate_model(y_true, y_pred, model_name):
    acc = accuracy_score(y_true, y_pred)
    conf_matrix = confusion_matrix(y_true, y_pred)
    cm = pd.DataFrame(conf_matrix,
                      index=['Actual Negative', 'Actual Positive'],
                      columns=['Predicted Negative', 'Predicted Positive'])
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)

    print(f"---{model_name}---")
    print(f"Confusion Matrix:\n{cm}\n")
    print(f"Accuracy: {acc:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}\n")

4.2.1. Logistic Regression¶

The Logistic Regression model, after being trained on the common feature set, exhibited high accuracy at 98.00% and an almost perfect recall of 99.98%, which indicates it was highly successful in identifying 'Good' loans. However, its precision of 70.42% suggests that it was less discerning in its positive predictions, leading to a substantial number of False Positives, as reflected by the confusion matrix.

In [216]:
log_reg = LogisticRegression(max_iter=1000, random_state=42)
log_reg.fit(X_train_common, y_train)

# Predict and evaluate
y_pred_log_reg = log_reg.predict(X_test_common)
evaluate_model(y_test, y_pred_log_reg, "Logistic Regression")
---Logistic Regression---
Confusion Matrix:
                 Predicted Negative  Predicted Positive
Actual Negative              161610                3459
Actual Positive                   2                8234

Accuracy: 0.9800
Precision: 0.7042
Recall: 0.9998
F1 Score: 0.8263

4.2.2. Decision Tree Classifier¶

The Decision Tree classifier yielded an improved performance with a precision of 78.57% and an impressive recall of 91.06%, achieving an overall accuracy of 98.39%. The higher precision indicates that the Decision Tree was more accurate in predicting 'Good' loans compared to Logistic Regression, albeit with a slight decrease in recall.

In [217]:
from sklearn.tree import DecisionTreeClassifier, plot_tree

decision_tree = DecisionTreeClassifier(random_state=42)
decision_tree.fit(X_train_common, y_train)

# Predict and evaluate
y_pred_dt = decision_tree.predict(X_test_common)
evaluate_model(y_test, y_pred_dt, "Decision Tree")
---Decision Tree---
Confusion Matrix:
                 Predicted Negative  Predicted Positive
Actual Negative              163023                2046
Actual Positive                 736                7500

Accuracy: 0.9839
Precision: 0.7857
Recall: 0.9106
F1 Score: 0.8435

4.2.3. Random Forest Model¶

Random Forest, an ensemble method, slightly outperformed the individual Decision Tree, notching a 98.41% accuracy and a recall rate of 91.46%. Its precision was comparable to the Decision Tree at 78.53%. The Random Forest's performance suggests that the ensemble approach provided a marginal improvement over the single Decision Tree, particularly in accurately identifying 'Good' loans.

In [218]:
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
random_forest.fit(X_train_common, y_train)

# Predict and evaluate
y_pred_rf = random_forest.predict(X_test_common)
evaluate_model(y_test, y_pred_rf, "Random Forest")
---Random Forest---
Confusion Matrix:
                 Predicted Negative  Predicted Positive
Actual Negative              163009                2060
Actual Positive                 703                7533

Accuracy: 0.9841
Precision: 0.7853
Recall: 0.9146
F1 Score: 0.8450

4.2.4. XGBoost Classifier¶

XGBoost Classification stood out with the highest accuracy at 98.42% and the best F1 Score of all models at 84.63%, which is a measure of the model's balance between precision and recall. Its recall of 91.82% and precision of 78.49% indicate that XGBoost was effective in predicting 'Good' loans and maintaining a balance between identifying as many 'Good' loans as possible while minimizing false positives.

In [219]:
from xgboost import XGBClassifier

# Initialize and train the XGBoost model
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb_model.fit(X_train_common, y_train)

# Predict on the test set
y_pred_xgb = xgb_model.predict(X_test_common)
evaluate_model(y_test, y_pred_xgb, "XGBoost Classification")
---XGBoost Classification---
Confusion Matrix:
                 Predicted Negative  Predicted Positive
Actual Negative              162997                2072
Actual Positive                 674                7562

Accuracy: 0.9842
Precision: 0.7849
Recall: 0.9182
F1 Score: 0.8463

4.2.5. Model Evalluation¶

The comparative model evaluation across four different algorithms—Logistic Regression, Decision Tree, Random Forest, and XGBoost—demonstrates a close competition in terms of performance metrics.

Logistic Regression exhibited a commendable accuracy of 98.00% and the highest recall among all models at approximately 99.98%, indicating its exceptional ability to identify 'Good' loans. However, its precision score of 70.42% was the lowest, suggesting a tendency to falsely label loans as 'Good' more often than other models. The F1 Score, which balances precision and recall, was decent at 82.63%.

Decision Tree showed an improved precision of 78.57% over Logistic Regression, which indicates better discernment in classifying loans as 'Good'. The accuracy also increased to 98.39%, and the recall remained high at 91.06%. The F1 Score, at 84.35%, reflects a balanced trade-off between precision and recall, indicating a solid overall performance.

Random Forest slightly edged out the Decision Tree with an accuracy of 98.41%, showing a marginal improvement. Precision was roughly on par with the Decision Tree, but Random Forest achieved a slightly higher recall of 91.46%, suggesting a minor enhancement in identifying 'Good' loans. Its F1 Score was comparable at 84.50%.

XGBoost topped the accuracy metric at 98.42%, albeit by a slim margin. Its precision was similar to the Random Forest model, but it boasted the highest recall after Logistic Regression at 91.82% and the best F1 Score at 84.63%, marking it as the most balanced model in terms of both identifying and correctly classifying 'Good' loans.

In summary, while Logistic Regression had the highest recall, indicating a strong ability to detect 'Good' loans, it also had a higher false positive rate. The Decision Tree and Random Forest provided more precision with a slight trade-off in recall. XGBoost achieved the best balance between all metrics, making it a strong candidate for scenarios where both high precision and recall are valued for identifying loan quality. The choice of model would ultimately depend on the specific business context and the cost-benefit analysis of false positives versus false negatives in loan prediction scenarios.

In [220]:
metrics_df = pd.DataFrame({
    'Accuracy': [accuracy_score(y_test, y_pred_log_reg), accuracy_score(y_test, y_pred_dt), accuracy_score(y_test, y_pred_rf),accuracy_score(y_test, y_pred_xgb)],
    'Precision': [precision_score(y_test, y_pred_log_reg), precision_score(y_test, y_pred_dt), precision_score(y_test, y_pred_rf),precision_score(y_test, y_pred_xgb)],
    'Recall': [recall_score(y_test, y_pred_log_reg), recall_score(y_test, y_pred_dt), recall_score(y_test, y_pred_rf),recall_score(y_test, y_pred_xgb)],
    'F1 Score': [f1_score(y_test, y_pred_log_reg), f1_score(y_test, y_pred_dt), f1_score(y_test, y_pred_rf), f1_score(y_test, y_pred_xgb)]
}, index=['Logistic Regression', 'Decision Tree', 'Random Forest', 'XGBoost'])

def highlight_max(s):
    is_max = s == s.max()
    return ['background-color: green' if v else '' for v in is_max]

styled_metrics_df = metrics_df.style.apply(highlight_max, subset=['Accuracy', 'Precision', 'Recall', 'F1 Score'])
styled_metrics_df
Out[220]:
  Accuracy Precision Recall F1 Score
Logistic Regression 0.980029 0.704182 0.999757 0.826333
Decision Tree 0.983947 0.785669 0.910636 0.843550
Random Forest 0.984057 0.785260 0.914643 0.845028
XGBoost 0.984155 0.784928 0.918164 0.846335
In [221]:
true_positive_gain = 2000  # Gain from a correct prediction
false_positive_cost = 100  # Cost from a false positive prediction
false_negative_cost = 1500  # Cost from a false negative prediction
In [222]:
conf_matrix_log_reg = confusion_matrix(y_test, y_pred_log_reg)
In [223]:
TP_lr = conf_matrix_log_reg[1, 1]
FP_lr = conf_matrix_log_reg[0, 1]
FN_lr = conf_matrix_log_reg[1, 0]
In [224]:
tp_value_lr = TP_lr * true_positive_gain
fp_cost_lr = FP_lr * false_positive_cost
fn_cost_lr = FN_lr * false_negative_cost
business_metric_lr = tp_value_lr - (fp_cost_lr + fn_cost_lr)

print(f"Business Metric for Logistic Regression: ${business_metric_lr}")
Business Metric for Logistic Regression: $16119100
In [225]:
# Assuming random_forest_model is your trained Random Forest model
# and X_test is your prepared test dataset
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)
In [226]:
TP_rf = conf_matrix_rf[1, 1]
FP_rf = conf_matrix_rf[0, 1]
FN_rf = conf_matrix_rf[1, 0]
In [227]:
tp_value_rf = TP_rf * true_positive_gain
fp_cost_rf = FP_rf * false_positive_cost
fn_cost_rf = FN_rf * false_negative_cost
business_metric_rf = tp_value_rf - (fp_cost_rf + fn_cost_rf)

print(f"Business Metric for Random Forest: ${business_metric_rf}")
Business Metric for Random Forest: $13805500

4.3. Business Application - Projected Loan Profitability (PLP)¶

4.3.1. Business Metrics¶

In evaluating the financial impact of predictive models on loan portfolio management, the Projected Loan Profitability (PLP) serves as a crucial metric. The PLP quantifies the net economic benefit of using a model to predict 'Good Loans,' factoring in the gains from correctly identified loans and the costs associated with misclassifications. This analysis uses the PLP to compare four models: Logistic Regression, Decision Tree, Random Forest, and XGBoost.

The PLP is calculated by considering the revenue generated from interest on correctly predicted 'Good Loans' (True Positives), and subtracting the losses from defaulted loans that were incorrectly predicted as 'Good' (False Positives), as well as the operational costs incurred for processing loans that were not correctly identified as 'Good' (False Negatives).

PLP calculation involves:

  • Revenue from True Positives: Interest revenue expected from loans correctly identified as 'Good', with an assumed value of $2,000 per loan.

  • Cost of False Positives: Losses attributed to loans erroneously classified as 'Good', which eventually default. The assumed loss per defaulted loan is $1,500.

  • Operational Cost of False Negatives: Expenses incurred in processing loan applications that were not accurately recognized as 'Good', set at $100 per application.

The financial figures utilized in the PLP calculations—interest revenue, loss from defaults, and operational costs—are hypothetical values. These have been chosen to illustrate the PLP metric's application in a business context and should be adjusted according to real-world data and institutional norms when applied practically.

In [228]:
# Business values
interest_revenue_per_good_loan = 2000  # Estimated interest revenue from a 'Good Loan'
loss_per_defaulted_loan = 1500  # Estimated loss from a loan that defaults
operational_cost_per_loan = 100  # Operational costs for processing a loan application

4.3.2. Model Performance¶

PLP values for each model provide insight into their respective financial impacts based on their predictions of loan repayment outcomes. The PLP quantifies the net benefit, combining the revenue from interest on accurately predicted 'Good Loans' and subtracting losses from defaults and operational costs associated with incorrect predictions. Here's an interpretation of the outcomes for each model:

  • Logistic Regression has the highest PLP at $16,119,100, indicating that this model's predictions are most aligned with maximizing revenue and minimizing losses within the context provided. Its high PLP suggests that despite its precision and recall balance, the model effectively identifies loans that are likely to be repaid, contributing significantly to the financial institution's profitability.

  • Decision Tree shows a PLP of $13,691,400, which is the lowest among the four models. This outcome suggests that while the Decision Tree might have a decent balance between identifying 'Good Loans' and avoiding 'Bad Loans,' it might not be as effective as the other models in minimizing losses from defaults or in maximizing revenue from interest.

  • Random Forest yields a PLP of $13,805,500, slightly above the Decision Tree, indicating a marginal improvement in predicting profitable loans and managing risks. The ensemble nature of the Random Forest, which aggregates predictions from multiple decision trees, likely contributes to this improved financial outcome by better capturing complex patterns in the data.

  • XGBoost has a PLP of $13,905,800, positioning it above both the Decision Tree and Random Forest but below Logistic Regression. The slight improvement in PLP over the Random Forest suggests that XGBoost's more advanced gradient boosting framework offers benefits in handling the nuances of loan repayment predictions, thus enhancing profitability.

In [229]:
# Define the function to calculate the Projected Loan Profitability (PLP) for a given model
def calculate_plp(conf_matrix, interest_revenue_per_good_loan, loss_per_defaulted_loan, operational_cost_per_loan):
    tp_value = conf_matrix[1, 1] * interest_revenue_per_good_loan
    fp_cost = conf_matrix[0, 1] * operational_cost_per_loan
    fn_cost = conf_matrix[1, 0] * loss_per_defaulted_loan
    plp = tp_value - (fp_cost + fn_cost)
    return plp

# Confusion matrices for each model
conf_matrix_log_reg = confusion_matrix(y_test, y_pred_log_reg)
conf_matrix_dt = confusion_matrix(y_test, y_pred_dt)
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)
conf_matrix_xgb = confusion_matrix(y_test, y_pred_xgb)

# Calculate PLP for each model
plp_log_reg = calculate_plp(conf_matrix_log_reg, interest_revenue_per_good_loan, loss_per_defaulted_loan, operational_cost_per_loan)
plp_dt = calculate_plp(conf_matrix_dt, interest_revenue_per_good_loan, loss_per_defaulted_loan, operational_cost_per_loan)
plp_rf = calculate_plp(conf_matrix_rf, interest_revenue_per_good_loan, loss_per_defaulted_loan, operational_cost_per_loan)
plp_xgb = calculate_plp(conf_matrix_xgb, interest_revenue_per_good_loan, loss_per_defaulted_loan, operational_cost_per_loan)

# Create a DataFrame to display PLP for each model
plp_df = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree', 'Random Forest', 'XGBoost'],
    'PLP': [plp_log_reg, plp_dt, plp_rf, plp_xgb]
})

plp_df_sorted = plp_df.sort_values('PLP', ascending=False).reset_index(drop=True)
plp_df_sorted.index = range(1, len(plp_df_sorted) + 1)
plp_df_sorted
Out[229]:
Model PLP
1 Logistic Regression 16119100
2 XGBoost 13905800
3 Random Forest 13805500
4 Decision Tree 13691400

4.3.3. Interpretation¶

The outcome reveals that the Logistic Regression model is the most beneficial from a financial perspective, suggesting that its way of balancing recall and precision, or its specific handling of the feature set, is particularly effective for the loan repayment prediction task in this business context. While ensemble methods like Random Forest and XGBoost show improvements over a single Decision Tree, they do not outperform Logistic Regression in terms of overall profitability as measured by PLP.

This analysis underscores the importance of considering not just traditional model evaluation metrics (accuracy, precision, recall, F1 score) but also business-specific outcomes like PLP when selecting models for deployment. A model's ability to support strategic business objectives—here, maximizing loan profitability—should guide the final selection for operational use.

5. Conclusion¶

The analysis conducted to evaluate the Projected Loan Profitability (PLP) of four predictive models—Logistic Regression, Decision Tree, Random Forest, and XGBoost—offers insightful revelations about the financial impact of utilizing machine learning in loan portfolio management. Logistic Regression emerged as the most profitable model based on PLP, highlighting its potential for maximizing revenue through accurate 'Good Loan' predictions while minimizing losses and operational costs associated with misclassifications.

5.1. Recommendations¶

Model Selection: Financial institutions should consider adopting Logistic Regression for loan approval processes, given its superior PLP performance. However, the choice of model should also account for the institution's specific risk tolerance, operational flexibility, and strategic objectives.

Model Calibration: Regular calibration of the chosen model is recommended to ensure its continued alignment with changing market conditions, borrower behavior, and economic factors. This includes adjusting the hypothetical values used in PLP calculations to reflect current realities.

Feature Analysis: Further analysis of the features contributing to model predictions can provide deeper insights into factors influencing loan repayment. Institutions may benefit from refining these features or exploring additional data sources to enhance model accuracy.

5.2. Limitations¶

Hypothetical Values: The PLP calculations rely on hypothetical financial figures, which may not accurately reflect all lending environments. Institutions should customize these values based on their historical data and market analysis to ensure relevance.

Model Complexity: While Logistic Regression showed the highest PLP, more complex models like XGBoost were close competitors. The simplicity of Logistic Regression offers ease of interpretation and implementation, but complex models may capture nuanced patterns that could prove beneficial in certain contexts.

Generalization: The findings are based on a specific dataset and set of assumptions. The performance of models may vary with different datasets, target definitions, and business contexts.

5.3. Concluding Remark¶

This study underscores the value of integrating predictive modeling into loan portfolio management to enhance profitability and manage risks effectively. By strategically leveraging models like Logistic Regression, financial institutions can make informed lending decisions that contribute positively to their bottom line. However, the dynamic nature of financial markets necessitates ongoing model evaluation and adjustment to maintain accuracy and relevance. Future research could explore the integration of additional predictive factors, alternative modeling techniques, and real-world application case studies to further enrich understanding and application of PLP in loan management.